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1. INTRODUCTION 

The exponential rise of internet community, digital assistants, sensing devices, and the internet of 
things (IoT) has accompanied a major increase in data. As these are increasing tremendously which gives rise 
to big data. According to a Gartner research from 2017, the volume of data had increased dramatically from 
17.3 billion in 2016 to 21.9 billion in 2021. Compared to conventional data, big data refers to excessive data 
growth in heterogeneous formats. Because it is so large and complex that standard data processing tools 
cannot cope with applications of big data [1], [2]. 

The task becomes challenging when data volume, variety, processing, and utilizing grows, to deal 
with such a challenging environment, many techniques came into existence. Big data refers to significant 
data expansion in a variety of formats. Big data analytics studies show massive and diverse data sets from 
roots to identify information such as covered patterns and unknown relationships, so better choices can be 
made. For storing data and processing of data a scalable architecture is needed. Big data refers to significant 
data expansion comprising various formats and is extremely large. As we know, the data size is so large that 
it cannot be processed in simple computational manual methods. Data processing is to be done based on the 
volume, velocity, variety [3]-[5]. 


Journal homepage: http://ijeecs.iaescore.com 


Indonesian J Elec Eng & Comp Sci ISSN: 2502-4752 m) 175 


The analysis of big data is established on: i) to speculate the outcomes and building models the 
statistical algorithm is used, ii) to figure out data patterns and their correlation data mining is used and iii) 
machine learning (ML) to solve the complexity of new models and new data. 

Textual data and speech recognition are used to examine the free-form text and spoken language. 
Figure | illustrates how the four types of big data analytics tools are grouped to give detailed information 
related to the field. The types are as: 

- Descriptive analytics is used to answer "What Happened?" from historical data. It helps find data trends 
and is utilized in business intelligence. Visualization is done through pie charts, tables, and graphs. 

- Diagnostic analytics addresses "Why?" using drill down, data discovery, data mining, and correlations. 
Data mining is used to extract information from unstructured data. Attempts to find systematic 
relationships in data help us determine the goal. 

- Predictive analytics tells an organization "What is likely to happen?" Once the corporation knows the 
above two analytics models, predictive analytics helps gather data to check what has transpired. It uses 
regression analysis, and pattern matching. To complete the assignments, you must know statistics and 
programming. 

- Prescriptive analytics teaches us "what to do" and is an advanced level. Techniques are used to analyze 
graphs, simulation results, complicated events, neural networks, and machine learning. Data 
architecture and implementation are needed for good data quality. 


Descriptive Analysis 


Types of 
Predictive Analytics Data Diagnostic Analytics 
Analytics 


Prescriptive Analytics 


Figure 1. Types of data analytics [6] 


Figure 2 shows the primary domains where data analysis is needed and where big data analytics can 
be employed, including healthcare, transport, E-commerce, banking and finance, and manufacturing. All 
fields have advanced greatly due to such analytics. Predictive big data analytics (PBA) helps solve large- 
scale data with hidden patterns, uncover opportunities and predictable outcomes, and act as data-driven 
technology. PBA uses machine learning techniques to forecast future occurrences by analyzing current and 
historical data. According to predictive analytics, machine learning is an intelligent tool for business that 
helps extract useful insights from enormous datasets for pioneering attempts. Traditional methodologies 
confront significant hurdles and become computationally impractical when it comes to huge data. However, 
when the data size expands, the algorithm's speed becomes hard to change the data across the system's 
processing units. Same efficient statistically machine learning methods are necessary to cope with large data 
while requiring minimal resources such as memory [7]-[10]. 

There are many issues related to the PBA system as it helps in extracting a large amount of 
knowledge with huge sample sizes and the problems arise when high dimensionality combined with its high 
computational cost and algorithmic instability and considered as the drawbacks of the PBA system [4]. One 
of its solutions is by increasing the size of big data that help in offering performance efficient outcome. Many 
efforts are being made to overcome these shortcomings, identify the accurate solution to such problems, and 
make a way out of the dimensions of datasets with the same virtue of data. One of the operations done on big 
data is to decrease the number of characteristics in data sets by maintaining main variables using a dimension 
reduction technique. When dimension reduction techniques are used, first data is fed to the machine learning 
prototype. In the PBA system, overcoming these challenges might be a significant exercise that must be 
undertaken. Effective predictive analytics necessitates a fast model design and the development of reliable 
prediction models. For a better understanding of high dimensional big data, researchers have the main goal 
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for developing a better PBA system. Machine learning regression algorithm is the fundamental strategy that 
helps make correct decision-making processes. In the PBA system, splitting random forest (SRF) regression 
is known for the machine learning algorithm. 
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Figure 2. Applications of data analytics [11] 


This study makes efforts for the PBA system to work onto the two major issues. First, with SRF, 
parameterization is substantially connected with prediction scores but is narrowly engrossed on ideal 
parameters. Accordingly, selecting an ideal collection of hyperparameters necessitates using an efficient 
model selection procedure. To fulfill behavior and trends from high dimensional large data SRF [4], [12], 
[13] prediction models are built. SRF optimizes hyperparameters intending to obtain adequate generalization 
performance. The suggested method will be able to track all hyperparameter combinations in SRF as their 
associated prediction scores and make a trade-off between predictive power and computing time [5], [14], 
[15]. 

One more is that emergence data, it comprises a huge number of complicated and high-dimensional 
characteristics. There is a greater impact on predictive power with some characteristics of data. Traits of data 
that have no bearing on prediction accuracy must be removed. Overhead processing time can be decreased 
with the dimension reduction [16] approach in developing random forest (RF) models. To assess the system's 
performance, two different methodologies, principal component analysis (PCA), and information gain (IG), 
can be studied, which are different dimension reduction approaches. The system data nature is then 
recognized, and a more advantageous approach is chosen. 

The main goal of the big dataapproach is to handle data in a way that is valuable to the enterprise. 
Otherwise, the expense of storing and maintaining data outweighs the value of processing it. The most 
difficult aspects of big dataanalysis are successfully processing the data to obtain useful information and 
using the processed data for decision-making. Many technologies are available on the market for analyzing 
and managing big data. We must select both practical and effective instruments for the research endeavor. 
The new 5v model is used to define big data. This model is built on the fundamental volume, velocity, and 
variety (3V) paradigm. The quality of big data is connected to value and veracity. Data storage, data 
processing, data quality are relevant, the main matter lies in privacy, security, and scalability of data. 


2. LITERATURE REVIEW 

PBA systems are concerned with data complexities, variability, privacy, portability, and voluminous 
data. Furthermore, high-dimensional data management and computational optimization issues have recently 
risen in popularity. This research paper identifies one of thePBA. systems with high-dimensional data and 
prediction. Researchers have described various developments in their previous work for boosting PBA 
system execution, presenting the system's design and prediction performance. Many traditional analytics 
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approaches, have difficulty learning on large-scale, complicated, and varied data. Predictive analytics system 
for data utilized in the metalwork sector to overcome the large-scale data analytics challenge [4], [17], [18]. 
To anticipate the long-term performance of power consumption, they employed the backpropagation neural 
networks technique. Their proposed approach provided a useful pattern of data in an unknown data 
correlation, and the learning parameters must be maintained so that the prediction score does not suffer. To 
recognize extreme occurrences, Shenoy and Gorinevsky [19] used Bayesian formulation which is 
nonparametric for the PBA system, Martino ef al. [20] and Zhang et al. [21] used the PCA methodology for 
ensemblingand learning using Apache Spark this research has compared the performance of various big data 
machine learning models but there were large data sets with a variety of characteristics within the data so the 
analysis was not executed properly. 

A spark-based parallel random forest (PRF) technique was introduced by Chen ef al. [22]. They 
used a hybrid strategy to improve the suggested PRF to overcome the challenge of high-dimensional data. 
However, depending on the type of datasets, PCA and IG techniques significantly impact the PBA system's 
prediction performance while doing dimension reduction [7], [22], [23]. It is necessary to investigate the 
performance efficiencies of various dimensionality reduction approaches for the PBA system. 

Along with commercial well-being, the digital age brings challenges and concerns. Massive data as 
in Table | comparision dimensions influence the PBA system's machine learning efficiency. High 
dimensional data are difficult to deal with existing well-known learning algorithms. Based on SRF, an 
systematic PBA system is proposed [8], [24]. To handle complicated problems high dimensional data is 
needed to get in presentable manner, the proposed PBA system enhance the tree-based approach. Enhancing 
the tree-based technique with help of hyperparameters optimization and dimension reduction technique. In 
this proposed model there is one source from where the data is input by various means like social media, 
mobile apps, sensing devices, IoT, and many others. Then this raw data is passed through the data pre- 
processing unit where cleaning generation and selection is done based on the type of data then data is sent to 
the model generation and then the data undergoes the process of hyperparameter optimization and dimension 
reduction techniques and then at the final stage the predictions are made. 


Table 1. Proposed predictive big data analytics model 


Author [Year] Technique and methodology used for Challenges faced in Research 
Research 
Hernandez and Zhang [1] Conventional Analysis technique. Needs to perform for a large scale. 
Shenoy and Gorinevsky 19] Bayesian formulation for PBA system. Solves the problem of excellent performance but 


needs to tackle PBA _ system with high 
dimensional data. 


Zhang and Yang [21] PCA based dimension reduction technique Focused on performance results with accuracy 
to reduce the dimension of big data. and processing time. 

Shin et al. [25] Predictive analysis used for metal cutting Parameters were not managed because of 
for back propagation neural networks. unknown data correlation. 

Ntaliakouraset al. [26] Decision tree algorithm with pre spark. Accuracy was not maintained for forecasting of 


tourism demand. 
Lakshmipadmaja et al. [27] Random subset feature selection (RSFS). Needs advancement in RSFS for handling high 
dimensional data. 


Data storage receives massive volumes of numeric data and links with computer server nodes for 
quick processing. Hadoop distributed file system (HDFS) is used to offer fault-tolerant manageable storage. 
When HDFS receives large amounts of data, it divides it into discrete chunks into multi-user computer nodes 
in a cluster. Data storage is intended to be cost-effective and scalable. Furthermore, it has been deliberately 
designed to be very fault-tolerant. At the serves, replicate data is sent and distributed to them. As an outcome, 
crush data on nodes can be discovered on other nodes in a cluster. While the data is being retrieved, the 
processing part continues in this process [9]. The data processing unit is the most important component of the 
suggested system for obtaining superior computing infrastructure and therefore mining and analyzing 
enormous amounts of data in a timely and effective way. 

Data analytics is a critical component of the suggested approach for achieving high predicted 
accuracy. It is divided into two stages: data pre-processing and system prediction model construction. 
Because big data can have inaccurate and redundant data, a data cleaning phase is conducted to eliminate or 
minimize noise using smoothing techniques. It eliminates outliers and corrects irregularities with missing 
value treatments. The standard normalizing approach is used for data processing and reduction. It contributes 
to a more intelligible pattern for prediction. SRF can be employed in the prediction model building phase to 
enable correct decision-making for the developed framework. The SRF predictor is a well-known indicator 
for the PBA system [10]. 
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3. HYPER-PARAMETERS OPTIMIZATION 

Hyperparameters are tunable parameters that must be fine-tuned to attain exceptional performance 
from a model. The reliability of the model depends on hyper-parameter optimization. The process of 
discovering the most optimal hyperparameter is called hyperparameter optimization. Every machine learning 
system contains hyperparameters, and the most fundamental goal in automated machine learning (AutoML) 
is to adjust these hyperparameters automatically to improve performance. Recent deep neural networks, in 
particular, rely heavily on a wide variety of hyperparameter options for the neural network's construction, 
regularization, and optimization. To analyze massive data efficiently, SRF needs the ideal value of 
hyperparameters [14], [28]. 

For some predictions, the default values can be used as they do not fulfil the requirement of data 
sets. By optimizing the parameters SRF can be improved as these parameters setting may be tweaked before 
training for maximum performance. If we take SRF as a predictive tool for optimizing hyperparameters then 
the amount of work done will be reduced. Researchers have found many conceivable hyperparameter 
combinations, and an individual SRF prediction model has been created for each pair. SRF creates a forest of 
trees by optimizing two hyperparameters: the tree's size and the tree's maximum depth. The number of trees 
(NumTrees) governs the computation cost and prediction model presentation [29], [30]. 

The maximum depth of the tree controls the depth of each tree with an exponential increase in time. 
Fundamentally, the model building procedures are built utilizing the greedy method to discover the optimal 
combination of hyperparameters and discard models. All possible hyperparameter combination sets can be 
produced and run to pick the ideal parameters for developing the best model for each dataset. To develop the 
decision model, we can employ the RF model creation from Spark MLIib and the relevant data pipelines and 
use a relevant model. 


3.1. The number of trees in the SRF 

The ideal number of trees governs the cost of computations and execution of the prediction model. 
Using both more and fewer trees may cause issues so choosing the appropriate number of trees is tricky one. 
Splitting in SRF is referred to as random number of features for each tree. This study has a variable number 
of trees at exponential rates in base two i.e., L=2j, j=1,2,..., 11,SRF is built and tested. 


3.2. SRF maximum depth 

The maximum depth of the tree determines the depth of each tree in the forest, and the running time 
grows exponentially with tree depth. It decreases the complexity of learning models and the risk of 
overfitting. Overfitting occurs when there is too much depth. RF, on the other hand, overcomes this problem, 
and proper tree depth can give good performance for error reduction. 


3.3. Dimension reduction 

The dataset's dimension point to the number of characteristics shown in the datasets. To eliminate 
processing time overhead during the model construction phase, some characteristics that do not affect the 
model are noticed and subsequently decreased utilizing dimension reduction techniques. The amount of 
characteristics given in a data collection is referred to as its dimensionality. Reducing data dimensionality has 
become a significant problem undertaking effective analysis in a distributed context. Thecombination of 
hyperparameter tuning and dimension reduction techniques can greatly improve the model's prediction 
performance [13], [31], [32]. Dimension reduction has emerged as a critical problem for achieving efficient 
analytics in a distributed context. There are already a variety of machine learning approaches that take feature 
significance into account. This study compares two common feature reduction approaches, principal 
component analysis and information gain, to validate the efficiency of dimension reduction strategies. This 
research provides a high-dimensional big data predictive analysis method based on enhancing scalable 
random forest (ESRF), which is utilized to analyze high-dimensional vast data, to further increase 
classification accuracy and stability. The combination of parameter optimization and dimensionality 
reduction dramatically increases the system's prediction performance. To avoid the processing time overhead 
of the model creation stage, the system needs to adopt a greedy technique to determine the optimal 
hyperparameter combination of SRF, which helps predict trends and behavioral patterns from high- 
dimensional big data and reduce data sets using PCA and IG technologies [33], [34]. The experimental 
findings suggest that the PBA system described in this study may show high predictive ability and an 
effective performance with the shortest processing time is the complete experimental data set. 


3.3.1. Dimension reduction with PCA 


This approach works by putting vast data into a subspace where changes may be detected, reducing 
the data's huge dimensionality. To find k-dimensions and represent them into a new set of variables PCA is 
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used, known as principle components [35]. The first component gives the largest possible variance of all data 
points and is important. 


3.3.2. Dimension reduction using IG 

IG is an approach that uses feature selection that helps minimise datasets' dimensions by determining 
the original data relevance. The primary concept behind this method is to sort the characteristics of each 
feature variable by computing the gain ratio value. The remaining ones are chosen from additional variables 
whereas the principal variable is the topmost feature variable. The greatest amount may be simply selected 
using the IG. To avoid overfitting, the feature variable with the highest value is replaced with the gain ratio 
value [33]. 


4. RESULTS AND ANALYSIS 

In the previous work, it was seen that PBA systems face many challenges in the area of data 
complexity, heterogeneity, privacy, maintaining a large volume of data, with this, there are problems of 
arranging and managing high dimensional data, and also there are issues of data computational. This 
proposed system of PBA can deal with both high dimensional and can also help in predictive analysis. In this 
proposed system, we can examine with big data analytics platform with one master node and three other 
nodes and can use any processor with 8 GB memory for an individual node. Five real-world datasets can be 
taken for examination like a credit card (U.C.I. means data repository of machine learning databases), and 
progressive web apps (PWA) likehigh-performance computing center north. 

It is proposed that for prediction accuracy mean absolute error (MAE) and root mean square error 
(RSME) can be used as evaluating metrics in the Table 2. Maybe after performing hyperparameter 
optimization on datasets, MAE. results could be better than the default parameter. Suppose if MAE with 
default parameter for credit card data is nearly around 0.1019 then after hyperparameter optimization, it can 
be near around 0.0017. 


Table 2. Mean absolute error (MAE)comparion table for ESRF with dimension reduction 


Datasets ESRF ESRF with DR_PCA _ ESRF with DR_IG 
Database autonomy service (DAS) 2.1300 1.0899 1.9999 
Susy 0.3400 0.3399 0.3199 
High performance computing (HPC) 1.0890 1.0299 0.8199 
Knowledge discovery in databases (KDD) 0.7876 0.7855 0.6699 
Credit-Card 0.1019 0.1030 0.0022 


The effectiveness of the PBA system affects by excessive data dimension so it is proposed that 
reduction techniques can be applied to reduce computational time and get the lowest MAE values for the 
datasets in Table 3. The lowest MAE and RMSE values of effect splitting random forest (ESRF) with 
DR_PCA for DAS datasets are 1.0907 and 2.0950 shown in Figure 3. The maximum vales are 2.256 and 
3.0465. If the MAE value for DSA dataset is 2.1300 then after applying the dimension reduction technique 
we can get around 1.0899 we can see how innovative researchers are trying to improve the accuracy of the 
PBA system by using different types of techniques and methods. 


Table 3. RMSE comparison of ESRF with dimension reduction 
DATASETS ESRF _ ESRF WITH DR_PCA _ ESRF WITH DR_IG 


DAS 3.7995 2.0950 3.0879 
SUSY 0.3940 0.3989 0.3199 
HPC 1.5467 1.0650 0.8939 
KDD 0.7779 0.7850 0.6699 
CREDIT-CARD _ 0.1019 0.0350 0.0249 


In this paper, the high dimensional big data and dimension reduction are proposed for an effective 
PBA system by using the SRF model for decision tree. The Figure 4 illustrates the tradeoff between the 
datasets and processing time. We need to maintain accuracy and efficiency for its design and 
implementation.In the DAS dataset, RF spends 129 seconds training and ESRF spends 71 seconds predicting 
the number of processors for upcoming workload traces. 

In the HPC2N dataset, ESRF forecasts that the required processors would distribute resources 
efficiently in 64 seconds, whereas RF requires 121 seconds. In the Susy dataset, RF takes 139 seconds to 
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anticipate the signal process, but ESRF takes just 76 seconds. In a credit-card dataset, ESRF can identify 
whether a card is counterfeit or real in 100 seconds, whereas RF takes 154 seconds. In the KDD dataset, 
ESRF outperforms RF in terms of processing speed. The suggested PBA system employs ESRF to transform 
a vast amount of disparate data into timely insights for speedier decision making. 
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5. CONCLUSION 

Data scientists’ primary focus is on the growth of big data which are high dimensional from 
numerous data sources. Predictive big data analytics is crucial for extracting business knowledge from this 
data and forecasting results. To develop an effective and scalable PBA system capable of handling high- 
dimensional large data this is critical. This study suggests a PBA system based on ESRF to deal with large 
amounts of high-dimensional data. It can also help in eliminating the time of processing in model 
construction. Using dimension reduction techniques to decrease irrelevant feature variables in datasets, ESRF 
with DR PCA and DR IG approaches can be utilized. In the DAS dataset, the ESRF prediction models with 
DR PCA achieve strong prediction scores (MAE 1.091 and RMSE 2.095) and reduce execution time from 
129 to 69 seconds. In the credit-card dataset, the suggested PBA system may deliver good prediction 
performance while minimizing processing time. In summary, an appropriate technique for determining 
optimum hyperparameters can be established. The two most commonly used dimension reduction approaches 
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for the PBA system can be used. As a result of the predictive analytics data nature, the advantage of 
information gain theory can be used for dimension reduction of the suggested system. The key conclusion of 
this work is that optimizing hyperparameters in SRF in conjunction with reduction techniques may 
considerably improve the system's prediction performance. To get accurate outcomes, the proposed PBA 
system can outperform the findings. 
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